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Abstract 

Background: A major goal of molecular evolution is to determine how natural selection has shaped the evolution of 
a gene. One approach taken by methods such as K/\/Ks and the McDonald-Kreitman (MK) test is to compare the 
frequency of non-synonymous and synonymous changes. These methods, however, rely on the assumption that a 
change in frequency of one mutation will not affect changes in frequency of other mutations. 

Results: We demonstrate that linkage between sites can bias measures of selection based on synonymous and 
non-synonymous changes. Using forward simulation of a Wright-Fisher process, we show that hitch-hiking of 
deleterious mutations with advantageous mutations can lead to overestimation of the number of adaptive 
substitutions, while background selection and clonal interference can distort the site frequency spectrum to obscure 
the signal for positive selection. We present three diagnostics for detecting these effects of linked selection and apply 
them to the human influenza (H3N2) hemagglutinin gene. 

Conclusion: Various forms of linked selection have characteristic effects on MK-type statistics. The extent of 
background selection, hitch-hiking and clonal interference can be evaluated using the diagnostic statistics presented 
here. The diagnostics can also be used to determine how well we expect the MK statistics to perform and whether 
one form of the statistic may be preferable to another. 

Keywords: Ka/Ks, McDonald-Kreitman, Linkage, Background selection, Hitch-hiking, Clonal interference 



Background 

Understanding the mechanisms by which natural selec- 
tion shapes the evolution of genes is one of the major aims 
of molecular evolution. One commonly used approach 
for the detection of positive selection in protein-coding 
sequences is based on comparing the frequency of non- 
synonymous or amino-acid (A) changes to the frequency 
of synonymous (S) changes [1]. For simplicity, synony- 
mous nucleotide changes that do not affect the protein are 
generally assumed to be neutral. In the absence of selec- 
tion and accounting for the genetic code, we expect both 
types of changes to be equally probable so that the rate of 
non-synonymous substitutions per site (Ka) is equal to the 
rate of synonymous substitutions per site (Ks)> a ratio of 
Ka/Ks > 1 indicates positive selection favouring a change 
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in the protein [2], However, this test is heavily conser- 
vative as proteins are generally under negative selection 
against amino acid changes that may affect protein func- 
tion. Positive selection at a small number of sites may be 
masked by negative selection removing non-synonymous 
changes in the rest of the protein [3]. 

The McDonald-Kreitman (MK) test [4] attempts to 
account for the presence of negatively selected sites by 
comparing Ka/Ks to / the proportion of nearly neutral 
sites in the sequence [5]. If selection is strong, deleteri- 
ous and beneficial mutations are expected to make little 
contribution to polymorphism; deleterious mutations are 
removed by selection and beneficial mutations reach fix- 
ation rapidly. Polymorphic sites are expected to consist 
largely of neutral variation, and the ratio of the num- 
ber of neutral non-synonymous polymorphic sites (Pa) to 
the number of synonymous polymorphic sites {P$) can 
be used as an estimator of / [6]. In the MK test, posi- 
tive selection is inferred when Ka/Ks > Pa/Ps- Following 
similar reasoning, Ka/Ks measured in a related sample 



© 201 3 Chan et at; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative 
BlOlVlGCl CGntr8l Commons Attribution License (http://creativecommons.Org/licenses/by/2.0), which permits unrestricted use, distribution, and 
reproduction in any medium, provided the original work is properly cited. 



Chan etal. BMC Evolutionary Biology 201 3, 13:244 
http://www.biomedcentral.com/1471-2148/13/244 



Page 2 of 15 



can be used as a measure of selective constraint so that an 
increase in the Ka /Ks ratio implies positive selection [7,8] . 

With the increasing availability of sequence data, vari- 
ous modifications of Ka/Ks methods have been developed 
to quantify the prevalence [6], strength [9,10] and dynam- 
ics of positive selection [11,12]. These methods rely on 
the assumption that sites segregate independently; that 
is, the change in frequency at one site will not affect the 
change in frequency at another site. In a large population 
with a high mutation rate, however, multiple mutations 
co-occur in the population and the change in frequency 
of one mutation also depends on selection acting at linked 
sites. Depending on the type of selection, linkage can have 
different effects; background selection, hitch-hiking and 
clonal interference can both increase or decrease fixa- 
tion probability or polymorphism frequency relative to 
expected levels, which we describe below. 

Background selection is the reduction in genetic vari- 
ability caused by linkage to negatively selected sites [13]. 
The effect of background selection on the probability of 
fixation is qualitatively similar to a reduction in effec- 
tive population size [13-16], which implies a higher than 
expected value of Ka/Ks under negative selection and 
a lower than expected value of Ka/Ks under positive 
selection relative to expectations under independently 
segregating sites [14]. Background selection also reduces 
the number of neutral polymorphic sites [17], and can 
result in a non-monotonic site-frequency spectrum, sim- 
ilar to the effect of continual adaptation [18,19]. Linkage 
between sites introduces dependencies in the site fre- 
quency spectrum, increasing the covariance even if the 
mean is unchanged [20]. Recent work with the struc- 
tured coalescent [21] in a model of only negative selection, 
provides analytical expressions for the number of both 
neutral and deleterious mutations showing that the effec- 
tive population size varies, both going back in time, and 
between individuals in different fitness classes. 

When both positive and negative selection operate on 
a locus, the dynamics of linked neutral and deleteri- 
ous mutations will also be affected by hitch-hiking [22]. 
Birky and Walsh [14] showed that hitch-hiking does not 
affect the fixation probability at neutral sites but increases 
the fixation probability at negatively selected sites, which 
implies that I<a/K$ values are elevated relative to expec- 
tation under independently segregating sites. For the MK 
statistic, the effect of hitch-hiking depends on its effect 
on polymorphism relative to its effect on divergence. The 
effect of hitch-hiking on neutral polymorphism has been 
described by Braverman et al. [23], but has not been 
characterised on a selected background. Previous find- 
ings [23-25] were largely based on coalescent simulations 
which allow only a small number of sites to be under 
selection and model the trajectory of beneficial mutations 
deterministically. Forward simulation studies [14,26-28] 



which begin with a number of positively selected sites 
and evolve towards mutation-selection equilibrium show 
that linkage affects a number of frequency-based statistics 
including Tajima's D and heterozygosity. 

Clonal interference (interactions between positively 
selected mutations) has also been predicted to reduce the 
fixation probability of beneficial mutations and promote 
the fixation of deleterious mutations; this was demon- 
strated in several experimental systems [29,30]. More 
recently, theoretical models assuming continual adapta- 
tion with a high supply of beneficial mutations have 
been used to obtain analytical expressions characterising 
genetic diversity. These models predict a non-monotonic 
site frequency spectrum with a large number of both low 
and high-frequency mutations [18,19,31]. This is equiva- 
lent to large number of lineages coalescing simultaneously 
and is often described as multiple-mergers [18,19,31]. 

Here, we examine the joint effects of background 
selection, hitch-hiking and clonal interference on the 
Ka/Ks and MK statistic. Based on theoretical studies 
[18,19,21,31], we expect different forms of distortion in 
the site-frequency spectrum due to these effects. Previ- 
ous simulation studies [14,26,27,32] have often considered 
these effects together, but here we distinguish between 
them by allowing both the strength of selection and the 
level of interference to vary. We do this using forward 
simulations with finite sites, allowing positive selection to 
occur at different times. Finally, we propose three diag- 
nostic statistics to indicate the degree to which (a) hitch- 
hiking of deleterious mutations (b) background selection 
and (c) clonal interference affect a sample of protein- 
coding sequences. 

Results 

The effect of background selection 

We begin by examining the effect of negative selection 
and linkage without positive selection in a protein-coding 
region of 500 codons evolving under a Wright-Fisher pro- 
cess. Negative selection is described by the distribution of 
fitness effects (DFE) of non-synonymous changes, which 
are specific to each codon site. The DFE is modelled using 
a gamma distribution where a large value of the shape 
parameter corresponds to a higher proportion of weakly 
deleterious mutations. 

The effect of background selection on the &> = Ka/Ks 
statistic is shown in Figure 1. The density of estimators 
with linked selection computed using Equation (19) is 
shown in solid lines, whereas the corresponding values 
obtained with independently segregating sites from PRF 
simulations are shown with dashed lines. Both simula- 
tions account for the contribution of segregating polymor- 
phisms. The effect of linkage, therefore, is shown by the 
difference between simulations with linkage and without 
linkage. As expected, the effect of background selection in 
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Figure 1 Distribution of a>. Distribution of u> under only negative selection for DFE shape parameters p = 0.25,0.5, 1,2. Solid curves indicate 
simulation results under complete linkage and dashed curves indicate results based on independently segregating sites using the PRF. Distributions 
were calculated from 1 00 sequences sampled at 6N generations with 500 replicates. 



reducing a> increases with /J and u. Our simulations also 
show that linkage increases the variance of the estimator 
due to correlations between linked sites. This is particu- 
larly evident for u = 10 -5 where the distribution of d> 
visibly broadens with increasing p. 

In Figure 2, we consider three forms of the MK statistic: 
(i) the uncorrected estimator o-mk (Equation 21), and (ii) 
Fay's corrected estimator ap (Equation 22) which removes 
low-frequency polymorphisms to reduce the effect of seg- 
regating deleterious polymorphisms and (iii) Bhatt's cor- 
rected estimator ag (Equation 23) which removes both 
low and high frequency polymorphisms that are likely 
to contain deleterious and beneficial mutations. In the 
absence of positive selection, we expect ap and as to per- 
form similarly, and this is indeed seen for u = 10 -6 . 
However, for simulations with a higher mutation rate 
and correspondingly larger effect of background selec- 
tion, discrepancies occur between the two statistics due 
to an increase in the number of high-frequency polymor- 
phisms. Unlike w, the variance of the MK statistics does 
not seem to be affected by linkage. In fact the performance 
of the MK statistics (in the absence of positive selection) is 
slightly improved by background selection which removes 
weakly deleterious mutations. 

The combined effect of background selection, clonal 
interference and hitch-hiking 

In the following section, we examine the combined effect 
of negative and positive selection. Positive selection is 
introduced at a fixed number of sites at intervals of r 
generations throughout the simulation, where all posi- 
tively selected sites have the same selective coefficient 
Sb- Decreasing r increases the probability of interfering 
positive sweeps. A comparison of the effects of differ- 
ent selective conditions on the site frequency spectrum 
is shown in Figure 3. Note that these curves represent 
averaged levels of polymorphisms, and dynamics can vary 
rapidly over time (see Additional file 1: Figures S3-20). 



We show results for low levels of background selec- 
tion (small u) in the left column and results for high 
levels of background selection in the right column. The 
(unsealed) synonymous site frequency spectrum is shown 
in the top row. When the effect of background selec- 
tion is small, the synonymous site frequency spectrum 
is close to the expectation under independently segre- 
gating sites (6/i; black dashed lines). Background selec- 
tion (bold grey lines) reduces the level of synonymous 
variation, particularly at medium frequencies, leading to 



CD 

a 



P = 0.25 , u = 10" 6 

' \ 1 

i /*bi 
I / if If 
' 1 '* 41 
I /' //. 
I /* II 


A 

a MK 

A 

A — 3 F 

\ HVi 
\ <\Vl 

V 1 v\ 

V 1 \\ 
\ \ \^ 







Q) O 

Q 



-15 -10 -5 0 


5 10 


A 


a 


(3 = 0.25, u = 10 -5 




I \ /I y; I \ 
/ \ A-'"* Y \ 












' /» yf \ 




/ '7 X/ V 











-60 -40 



-20 



20 



Figure 2 Distribution of the MK statistics. Distribution of the MK 
statistics under only negative selection. Results are shown for 
simulations with complete linkage (solid lines) and independently 
segregating sites (dashed lines) for different DFEs and mutation rates. 
The true number of adaptive substitutions (zero) is indicated by the 
dotted vertical line. 
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Figure 3 The effect of linkage on the site frequency spectrum. The synonymous site frequency spectrum (top row) and the ratio of 
non-synonymous to synonymous frequency spectrum (bottom) is shown for B = 0.25 with mutation rates u = 10~ 6 and 1 0 5 . Al I curves are 
averaged over 500 replicates, under conditions of only negative selection (grey), and different conditions of positive selection (coloured lines). Black 
dashed lines show the expected behaviour of the neutral site frequency spectrum under independently segregating sites ((?//) and black dotted 
lines indicate the leading order behaviour expected under constant adaptation (8/i 2 ). In the bottom panels, solid lines show the average 
non-synonymous to synonymous ratio for only negatively selected sites, whereas dashed lines show the ratio across both positively and negatively 
selected sites. 



a non-monotonic distribution, but the effect is not as 
severe as clonal interference. Linked positive selection fur- 
ther reduces polymorphism levels; a slow rate of sweeps 
with strong selection (orange lines) primarily affects high- 
frequency mutations while a high supply of weak positive 
selection (green lines) results in smaller levels of reduc- 
tion at both low and high frequencies. When both the 
supply rate and the strength of positive selection is strong 
(pink lines), the synonymous site frequency spectrum 
approaches 0/i 2 (black dotted line), which is the leading 
behaviour predicted for continual adaptation [19]. 

To examine how linkage affects selected mutations, we 
compare the ratio of the averaged frequency spectra for 
non-synonymous (A) and synonymous (S) sites (Figure 3, 
bottom row). The A/S ratio in the absence of positive 



selection is indicated by the bold grey line, whereas the 
A/S ratio for deleterious sites linked to positively selected 
sites is shown by coloured solid lines. The discrepancy 
between the grey and coloured lines reflects the effect of 
hitch-hiking; there is a slight increase in the A/S ratio at 
high-frequencies due to hitch-hiking. Note that the actual 
number of deleterious polymorphisms is reduced relative 
to simulations with no positive selection (Additional file 1: 
Figure SI) but the number of synonymous polymorphisms 
is reduced by a relatively greater proportion. 

Comparing the A/S ratio with (dashed coloured lines) 
and without (solid coloured lines) beneficial mutations, 
it can be seen that beneficial mutations can segregate at 
all frequencies when the supply rate is high (green and 
pink lines), but mutations segregating at high frequencies 
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tend to include more beneficial mutations. Comparison 
of the two panels in the bottom row also shows that 
higher levels of background selection increase the effect of 
both hitch-hiking (solid coloured lines) and clonal inter- 
ference (dashed coloured lines), as distortions in the site- 
frequency spectrum tend to occur over a wider range of 
frequencies. Similar results are seen for larger values of /3 
with more pronounced reductions of synonymous poly- 
morphism due to background selection, and changes in 
the A/S ratio due to hitch-hiking and clonal interference 
are spread across a broader frequency range (Additional 
file 1: Figure SI). 

The contributions of background selection, hitch-hiking 
and clonal interference result in qualitatively different 
behaviour in the site-frequency spectrum, and this in turn 
causes characteristic types of bias in the various forms of 
the MK statistic. This is summarised in Figure 4, where 
we compare the performance of different forms of the 
MK statistic in estimating the true number of beneficial 
mutations in each simulation. Here, we do not consider 
the uncorrected umk as it was severely biased in all the 
simulations we examined. An additional MK statistic, au 
is considered which uses divergence information from 



simulations with no positive selection instead of estimat- 
ing selective constraint from polymorphism information. 
Comparison of ap or as against ao, therefore, shows how 
much of the bias is due to incorrect estimation of selective 
constraint. 

The different panels in Figure 4 correspond to dif- 
ferent combinations of positive and negative selection: 
low levels of background selection (due to strong nega- 
tive selection) and infrequent positive sweeps (top left), 
low levels of background selection and frequent positive 
sweeps (bottom left), high levels of background selection 
and infrequent positive sweeps (top right) and high lev- 
els of background selection with frequent positive sweeps 
(bottom right). When the effect of background selection 
is large (top right), both ap and as tend to underestimate 
the true number of adaptive substitutions. For low levels 
of background selection or frequent positive sweeps, the 
effect of hitch-hiking (controlled by /3) and the amount of 
clonal interference (using the observed number of bene- 
ficial mutations as a proxy) has a consistent effect on the 
MK statistics. For small values of /J so that low levels of 
hitch-hiking occur, ag has smaller bias than ap. However, 
for high levels of hitch-hiking ap is less biased, particularly 
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Figure 4 Effect of background selection, hitch-hiking and clonal interference on the MK statistics. Lines indicate the median relative error of 



different forms of the MK statistics for 55 : 
and clonal interference. 
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when clonal interference is low. Results for different val- 
ues of Sh were qualitatively similar but with larger relative 
error for weaker positive selection. 

The reason for these biases is intuitively clear from the 
site frequency spectrum, ag differs from uf only in that 
it does not use polymorphism data at high frequency. 
Therefore, as is more robust against clonal interference 
(Figure 4, bottom row) as beneficial mutations are more 
likely to segregate at high frequencies. However, when 
weakly deleterious effects are prevalent (Figure 4, solid 
pink lines), «g is upwardly biased as it does not account for 
the relaxation of selective constraint due to positive selec- 
tion. This is confirmed by the similar values obtained for 
ag and «d, suggesting that removal of high and low fre- 
quency polymorphisms in the context of linked selection 
has a similar effect to that expected under independently 
segregating sites, namely the removal of both positively 
and negatively selected mutations. The correction of Bhatt 
et al. [33] does not perform well when there are high 
levels of background selection as distortions in the site 
frequency spectrum are spread across a wider range of 
frequencies than without background selection. 

Diagnostics for linkage effects 

In the previous section, we showed that much of the bias 
in the comparative estimators can be explained in terms 
of background selection, hitch-hiking and clonal inter- 
ference. In order to detect these effects using samples 
of protein-coding sequences, we construct and examine 
three diagnostic statistics. 

The first diagnostic tests for an excess of low frequency 
non-synonymous polymorphisms relative to medium fre- 
quency polymorphisms. For a sample size of «, we con- 
sider a mutation to occur at low frequency if it occurs 
i times in the sample, where i belongs to the set Xi = 
{1,2,..., [0.1 5m] —1} and square brackets indicate round- 
ing to the nearest integer. Charlesworth and Eyre-Walker 
[34] showed that the majority of deleterious polymor- 
phisms occurred in this frequency range even when the 
sample size is varied. Similarly, we consider a muta- 
tion to occur at medium frequencies if the number 
of times it occurs in the sample belongs to Im = 
{[0.15w] , [0.15m] +1, . . . , [0.75m] }. The first diagnostic is 
given by 



Eieli PS (0 + 1 T, ie x M PS (0 + 1 ' 



(1) 



If weak deleterious effects are rare, then we expect that 
most deleterious mutations are immediately removed 
from the population. In this case, most polymorphisms 
would be selectively neutral and we would expect that 
the ratio of non-synonymous to synonymous polymor- 
phisms, at any frequency range, is simply determined by 
the mutational bias. The difference of the two ratios in D\ 



is therefore expected to equal zero in the absence of weak 
deleterious effects and large values are indicative of a high 
frequency of weak deleterious mutations, which results in 
susceptibility to hitch-hiking. 

In Figure 5, we show the correlation between D\ and the 
amount of hitch-hiking, which we measure as the relative 
excess of non-synonymous substitutions at non-beneficial 
sites in simulations with positive selection compared to 
simulations with no positive selection. A value of 1.0 in the 
x-zxas corresponds to half of all non-synonymous substi- 
tutions being due to hitch-hiking. When positive selection 
is weak so that ag < 0 (open circles), D± correlates with 
the p shape parameter so that values of D\ > 0 indi- 
cate susceptibility to hitch-hiking. When strong positive 
selection occurs, selective constraint is reduced so that the 
proportion of mutations that can be considered weakly 
deleterious may be increased. In this case, we see that 
D\ is also increased, even for small values of fi. Inter- 
pretation of the D\ statistic, therefore, should depend on 
both the value of D\ and the MK statistic. We use as 
here as Figure 4 indicates that it is less likely to result in 
underestimation than . 

The second diagnostic tests for an excess of high fre- 
quency polymorphisms which is an indication of multi- 
ple merger events [18,31] due to interfering mutations 
that can be either negatively (background selection) or 
positively(clonal interference) selected. We compare the 
number of high frequency polymorphisms to medium fre- 
quency polymorphisms, where a mutation is defined to be 
at high frequency if the number of times it occurs in the 
sample belongs to Ih = {[0.75m] +1, ...,« — 1} and \x\ 
denotes the number of elements in the set x, 
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(2) 



Deleterious mutations are not expected to persist to 
medium frequencies, so polymorphisms at medium and 
high frequencies can be assumed to be neutral or benefi- 
cial. Under assumptions of neutrality and independently 
segregating sites, the expected number of polymorphic 
sites that occur at frequency i is given by E(pa(l)) = 6a/U 
where 8a = 2uNLc/(c + 1), giving an expectation of 
Z) 2 = 0. Values of D 2 < 0 can, therefore, indicate strong 
linkage effects due to an excess of beneficial or deleterious 
mutations. 

A third statistic can distinguish between the effect of 
background selection and clonal interference, 
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Figure 5 Diagnostic for hitch-hiking. Circles and bars indicate median and interquartile ranges from all combinations of recurrent sweeps with 
r = 10 3 , 1 0 4 , Sb = 1 CT 3 , 1 (T 2 , u = 1 CT 5 , 1 0~ 6 and p = 0.25, 0.5, 1 , 2. Parameter combinations which give a median 5g < 0 are denoted with an 
open circle whereas simulation conditions which result in 3g > 0 are shown with closed circles. The x-axis is the relative excess of non-synonymous 
substitutions due to linked positive selection, calculated as (k' — k)/k averaged across all time points after 40000 generations, where k is the mean 
number of non-synonymous substitutions at non-beneficial sites averaged across 500 simulations with no positive selection and the prime 
indicates the corresponding values in a simulation with both positive and negative selection. 



This statistic tests for an excess of high-frequency non- 
synonymous polymorphisms relative to both medium 
frequency non-synonymous polymorphisms and high- 
frequency synonymous polymorphisms. As with D\ and 
Di, the expectation under independently segregating neu- 
tral sites is D3 = 0 and values of D% > 0 are indicative 
of clonal interference. In Figure 6, values of Di and D3 
are shown for varying levels of background selection and 
clonal interference. In the left panel, low mutation rates 
generate only low levels of background selection and val- 
ues of D2 and£>3 are strongly correlated, as both are due to 
clonal interference. In the right panel, a high mutation rate 



increases levels of both background selection and clonal 
interference. Simulations with a high supply rate of bene- 
ficial mutations (filled red circles) have large values of D3 
and strongly negative D2 values, whereas simulations with 
a low supply rate of beneficial mutations and occasional 
instances of clonal interference tend to small positive val- 
ues of D3 with negative values of D2 (filled blue circles). 
When only high levels of background selection are acting, 
both .D3 and D2 fall below zero (open black circles). The 
behaviour of these three diagnostics are similar for differ- 
ent sample sizes (Additional file 1: Figure S2) and different 
population sizes (Additional file 1: Figures S15-20). 
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Figure 6 Diagnostics for clonal interference and background selection. Median values of D2 and D3 are shown for all combinations of 
Sf, = 10~ 3 , 10~ 2 and /} = 0.25,0.5, 1,2 for all time points after 40000 generations. Bars represent interquartile ranges for D3. In the left panel, 
negative values of D2 are mostly due to clonal interference but in the right panel, negative values of D2 are caused by a combination of clonal 
interference and background selection. 
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In Figure 7, we show that the bias of ap and ag varies 
systematically with D% (clonal interference) and D\ (hitch- 
hiking). Larger values of D\ and D3 tend to result in larger 
values for both statistics; for ap this tends to reduce the 
magnitude of the bias, but increases bias for ag. This sug- 
gests that ap performs better for large D\ but o-b performs 
better for large D3 and small D\. The size of the bias for 
both statistics is larger for higher mutations rates (bottom 
row, u = 10~ 5 ) which corresponds to very large D2 val- 
ues (Figure 6) and larger effects of background selection. 
In particular, when D3 < 0 and D2 0, both statistics are 
expected to heavily underestimate the amount of positive 
selection that has occurred. 

To evaluate whether D\, D2 and D3 differ from zero, 
we use a non-parametric bootstrap, recalculating statis- 
tics after resampling with replacement from the original 



sequence sample. The scaling factor for mutation bias 
c, which is omitted from D\, is automatically accounted 
for by this method. Confidence intervals for D\ were 
constructed from the bootstraps using the 2.5 to 97.5 
percentiles. As D2 is slightly biased, confidence intervals 
for £>2 and D3 were constructed using the BCA method 
provided in R [35]. 

Application of diagnostics to human influenza A (H3N2) 

We applied the diagnostics with the bootstrap method to 
the human influenza A (H3N2) dataset used by Strelkowa 
and Lassig [36]. The dataset comprises 2030 sequences 
with a length of 330 codons spanning 1968-2007. The 
list of accession numbers is provided in the Additional 
file 1 in [36]. Following Strelkowa and Lassig [36], we 
used A/Bilthoven/16190/1968 as the ancestral sequence; 
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Figure 7 Correlation between diagnostics and bias of the MK statistics. Each point in the plot represents a single simulation replicate with the 
value of bias of hp (left column) and 3g (right column) indicated by the colour of the point. Results are shown for u = 1CT 6 (top row) and u = 10~ 5 
(bottom row), and each panel consists of 100 replicates from all combinations of recurrent sweeps with x = 10 3 , 10 4 , 5b = 10~ 3 , 10~ 2 and 
fS = 0.25,0.5, 1,2. Simulations with bias outside the range of the colour scale were set at the extreme values and points with less than two medium 
frequency polymorphisms were excluded. 
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results using A/Hong Kong/1/1968 were very similar. 
Diagnostics D\ and£>2 were computed for samples in each 
year separately, with sample sizes ranging from 5 to 215. 
The results are shown in Figure 8. There is some vari- 
ation over time, with wider confidence intervals in the 
earlier samples due to small sample sizes, but D\ values 
are mostly centred around zero, suggesting low levels of 
hitch-hiking. Hitch-hiking cannot be conclusively ruled 
out as confidence intervals are quite wide and a number 
of points reach D\ = 1. However, values of D\ remain 
consistently low with a number of time points falling 



below zero which is more consistent with a low hitch- 
hiking scenario. In contrast, simulations with prevalent 
hitch-hiking tend to to have confidence intervals that are 
consistently above zero and point estimates much higher 
than 0.5 (Additional file 1: Figures S3-9). Values of D2 are 
strongly negative, indicating a strong effect due to inter- 
fering deleterious or beneficial mutations; the magnitude 
of D2 is consistent with a high level of background selec- 
tion. Multiple time points with D3 » 0 also suggests 
that clonal interference frequently occurs in the evolution 
of H3N2. 
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Discussion 

It has long been known that linkage influences polymor- 
phism frequencies and fixation probabilities which can 
adversely affect methods that assume independent seg- 
regation of sites. The MK statistic, which compares the 
number of polymorphic sites rather than using only fre- 
quency information, is generally considered to be more 
robust to linkage effects than frequency-based statistics 
[20,27]. In this study, we show that the MK statistic can be 
affected, depending on the mode of linked selection and 
its characteristic effect of the site-frequency spectrum. 

When background selection has a large effect, distor- 
tions in the site-frequency spectrum result in a downward 
bias in the MK statistics. However, when the effect of neg- 
ative selection is low compared to positive selection, the 
performance of the MK statistics as a quantitative estima- 
tor of the number of adaptive substitutions is determined 
by the extent of hitch-hiking and clonal interference. 
When negative selection is strong so that levels of hitch- 
hiking are low, ag tends to perform better. Specifically, it 
is more robust against distortions of the site frequency 
spectrum at high frequencies caused by background selec- 
tion or clonal interference. However, hp is more robust to 
hitch-hiking which occurs when weak negative selection 
is pervasive. 

Our results are consistent with that of a recent study by 
Messer and Petrov [32] showing that ap tends to under- 
estimate the number of adaptive substitutions. We find 
this primarily occurs when background selection has a 
large effect and positive selection is infrequent. However, 
when positive selection is strong, hitch-hiking can also 
lead to overestimation as suggested in some empirical 
studies [37]. When interactions between deleterious poly- 
morphisms dominate the dynamics of the populations, 
the asymptotic correction proposed by Messer and Petrov 
[32] can be used to correct for underestimation due to low 
and medium frequency deleterious polymorphisms. This 
method corrects for deleterious mutations, as the relative 
abundance of deleterious mutations is reduced at higher 
frequencies, but cannot be applied for beneficial muta- 
tions which have increased relative abundance at higher 
frequencies. 

Here, we show that, when background selection is rel- 
atively weak, choosing the appropriate form of the MK 
statistic can reduce estimation bias. Messer and Petrov 's 
[32] results apply for organisms with large genomes and 
many weakly deleterious mutations, but when genomes 
are small and selective effects are broadly distributed, as 
is the case viral populations [38,39], the considerations 
raised in this study apply. 

Understanding the effects of linked selection also 
affects our interpretation of these estimators. The num- 
ber of adaptive substitutions cannot be directly related to 
either the strength of selection or the supply of beneficial 



mutations, but it is a combination of both of these factors. 
For example, Strelkowa and Lassig [36] and Koelle et al. 
[40] raised alternative hypotheses concerning whether 
periodic positive sweeps in human influenza were due to 
a limiting supply of beneficial mutations, or by a high sup- 
ply rate with competition between beneficial mutations 
limiting the fixation rate. 

The selective regime is important, then, for both the 
application and interpretation of these estimators. We 
present three statistics for evaluating the effects of linked 
selection. D\ signals the presence of weak deleterious 
mutations that can potentially cause hitch-hiking and is 
based on identifying an excess of non-synonymous low- 
frequency polymorphisms. More sophisticated methods 
to characterise the DFE have been used [9,41], but these 
methods rely on a number of assumptions and have given 
conflicting results. Consequently, it is useful to have a 
simple diagnostic that flags when hitch-hiking might be 
an issue. We have not attempted to use standard site- 
frequency based indicators of hitch-hiking (e.g [42]) which 
test for an excess of low and high frequency polymor- 
phisms. As demonstrated by Kim [43], the excess of 
high frequency polymorphisms may not occur in recur- 
rent sweeps. In addition, this effect can be complicated 
by clonal interference. If no comparative information is 
available, the excess of low frequency polymorphisms 
cannot be distinguished from a model of population 
growth [44]. 

Our second diagnostic, D2 detects an excess of high- 
frequency non-synonymous polymorphisms signalling 
strong linkage effects on the population, either due 
to a large number of deleterious mutations causing 
background selection, or a large number of beneficial 
mutations causing clonal interference. In both cases, MK 
statistics are biased and estimators assuming indepen- 
dently segregating sites must be interpreted with care. We 
can distinguish between the effects of background selec- 
tion and clonal interference by using a third statistic, D3. 
The diagnostic for clonal interference presented here fol- 
low a similar reasoning to the method used by Strelkowa 
and Lassig [36] in testing for an excess of high frequency 
non-synonymous polymorphisms. Our method has more 
general applicability as it accounts for the effect of dele- 
terious mutations, and can be used even when it is not 
known which region of the sequence is positively selected. 

We have applied these diagnostics to the dataset used 
by Strelkowa and Lassig [36]. Our results provide fur- 
ther support for their conclusion that clonal interfer- 
ence occurs in human influenza A. The authors also 
raised the question of whether strong and frequent pos- 
itive selection would promote the fixation of deleterious 
mutations, increasing the brittleness of the protein. The 
values obtained for D\, however, suggest that strong neg- 
ative selection is predominant, so that hitch-hiking in 
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the HA1 region is rare; this is in agreement with Shih 
et al. [45], who showed that few non-synonymous sub- 
stitutions occurred outside antigenic epitopes. It is also 
consistent with a phylogenetic study by Nielsen and Yang 
[9] that estimated the DFE shape parameter in that 
region as 0.373, indicative of low sensitivity to hitch- 
hiking in our model. The combination of clonal interfer- 
ence and robustness against hitch-hiking suggests that the 
modification used by Bhatt et al. [33] is appropriate for 
application to the HA1 region. 

In this study, we have not considered the effect of 
population size changes, which are known to affect site- 
frequency based statistics. However, we expect D\ and 
-D3 to be relatively robust, as they are based on compar- 
isons between the non-synonymous and synonymous site 
frequency spectra [32]. Population expansions, which are 
expected to have the strongest effect on low-frequency 
mutations [46], should have minimal effect on Di and 
-D3. Population bottlenecks, however, can remove medium 
frequency polymorphisms, potentially elevating the mag- 
nitude of both D2 and D3 and giving false positives for 
clonal interference. We have also not examined the effect 
of selection against synonymous mutations. We expect, 
however, that D\ and D3 should not be strongly affected as 
long as selection against synonymous mutations is weaker 
than against non-synonymous mutations. D2 does not use 
information from the synonymous site frequency spectra 
and should not be affected by selection against synony- 
mous mutations, but negative values of D2 may also result 
from background selection at synonymous sites. These 
effects could be considered in more detail in future simu- 
lation studies. 

Conclusions 

We have shown that linked selection is responsible for 
biases in the MK statistics, causing underestimation when 
there are high levels of interference between selected 
mutations, and overestimation for strong non-interfering 
sweeps. The statistics presented in this study can be 
applied to samples of protein-coding sequences to eval- 
uate the influence of linked selection for the parameter 
range studied here. Values of D\ that are significantly 
greater than zero signal susceptibility to hitch-hiking 
while values of D2 significantly smaller than zero indi- 
cate the occurrence of multiple mergers. Multiple mergers 
due to clonal interference can be distinguished from back- 
ground selection when D3 > 0. 

Based on our simulations, when D2 < 0, D3 > 0 and 
Di « 0, we recommend using a statistic such as as, which 
excludes both low- and high-frequency polymorphisms. 
On the the hand, when high values of D\ (signalling hitch- 
hiking) are obtained, it is more appropriate to use Uf 
which uses medium and high-frequency polymorphisms, 
accounting for change in selective constraint due to 



hitchhiking. In cases where D2 <5C 0 and D3 <= 0, both 
«f and ag are expected to perform poorly. 

Methods 

Simulation of sequence evolution under linkage 

We simulate the evolution of a population, represented 
as a sequence of length L = 500 codons (nucleotide 
triplet). Each codon site is associated with a selection coef- 
ficient, Sd, which is drawn from the distribution of fitness 
effects (DFE; see Distribution of deleterious effects, below). 
The DFE affects both the extent of background selection 
and hitch-hiking. To model a well-adapted population, 
each simulation is initialised so that all non-synonymous 
changes from the ancestral sequence are negatively 
selected, reducing fitness by a factor of 1 — S4. All syn- 
onymous changes are neutral. Throughout the simulation, 
positive selection is introduced at a specified number 
of sites at fixed times. After the introduction of posi- 
tive selection, an individual carrying a non-synonymous 
change from the ancestral sequence at the positively 
selected site undergoes a change of fitness by a factor of 
1 + Sb- The timing of the introduction of positive selec- 
tion and the strength of selection (see Positive selection, 
below) control the extent of clonal interference. The 
extent of hitch-hiking is determined by the interaction 
between the DFE and positive selection. 

Each simulation is initialised with a haploid population 
of N = 10 4 monomorphic individuals. The mutation pro- 
cess follows a Kimura two-parameter model [47], with the 
transition-transversion ratio fixed at k = 3. Ancestral 
sequences are generated randomly assuming that the base 
frequency of all 61 non-stop codons are equal, and all 27 
one-step mutations at a codon are allowed. For k > 1, the 
mutation probabilities are not equal. Individuals carrying 
stop-codons have fitness set to zero. 

In each generation, the total number of mutations intro- 
duced into the population follows a Poisson distribution 
with mean uNL, where the mutation rate per site per 
generation is u = 10 -6 or u = 10 -5 and occurs uni- 
formly across all sites and all sequences. We assume 
non-overlapping generations and individuals reproduce 
by multinomial sampling with probability proportional to 
their fitness, as in a Wright-Fisher process. 

Distribution of deleterious effects 

The selection coefficient at each site is drawn from a con- 
tinuous distribution of fitness effects (DFE), which we 
model using the gamma distribution following previous 
studies [9,34,41], 



p(x,p,s) = 



r(/J) 



(4) 



where ft is the shape parameter and s is the mean 
selective coefficient. We consider shape parameters of 
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P = 0.25, 0.5, 1, 2, which is similar to the range used by 
Charlesworth and Eyre-Walker [34]. Estimated values in 
the literature range from 0.23 [48] to 3.22 [9]. The mean 
strength of selection was set at 5 = 4.4 x 10 _1 ,8.5 x 
10~ 3 , 1.5 x 10~ 3 , 7.0 x 10~ 4 , each of which in combina- 
tion with the respective ji value above gives coq « 0.1 in 
the presence of linkage for u = 10 -6 . 

The shape parameter /3 controls the proportion of 
weakly deleterious mutations, and therefore the extent 
of hitch-hiking, and in combination with u, the amount of 
background selection. For small values of /J, the distribu- 
tion of selection coefficients is broadly distributed with a 
larger proportion of both nearly neutral and strongly dele- 
terious mutations; large values of /3 give a more strongly 
peaked DFE centred at nearly neutral to weakly delete- 
rious values. Background selection is primarily mediated 
by the deleterious mutations that are sufficiently weakly 
selected that they are able to persist to appreciable fre- 
quencies but accumulate to increase the extinction prob- 
ability of linked neutral and beneficial mutations. This 
range of selective coefficients is given approximately by 
0.5 < Ud/sd < 5 [38], where Ud is the genomic mutation 
rate at selected sites. Equating Ud with the genomic muta- 
tion rate gives a range of 6.7 x 10 -5 < Sd < 6.7 x 10 -4 
for u = 10 -6 , but Ud is generally smaller than U for finite 
values of /3. For /3 = 0.25, less than 5% of sites lie within 
this range so that strong negative selection dominates and 
most deleterious mutations are rapidly removed from the 
population. For u = 10 -5 , all mutations with 6.7 x 10 -4 < 
Sd < 6.7 x 10 -3 contribute to background selection, which 
covers the range around 1/N, so that much higher lev- 
els of background can be observed. Similarly, the extent of 
hitch-hiking is controlled by the proportion of sites with 
weak deleterious effects relative to the strength of positive 
selection, with the specific range varying according to the 
strength and prevalence of positive selection. 

Positive selection 

To examine the effect of linked positive selection, we 
introduce positive selection at a small number of codon 
sites in the sequence. Unlike negatively selected sites that 
individually have small effects but cumulatively can have 
a strong effect due to the large number of negatively 
selected sites, positive selection is expected to be rare, but 
a single site can have a strong effect. Thus we model all 
positively selected sites to have the same fixed selective 
effect Sb- 

At regular time intervals, we randomly choose a site and 
change the selective coefficient to s/, to generate recur- 
rent sweeps. This models a scenario of continuous positive 
selection, with beneficial mutations arising at different 
times. By varying the interval between each introduction 
of positive selection, we can model full selective sweeps 
that occur successively [43] or interfering sweeps [49]. 



Note that unlike coalescent simulations [43,49], we con- 
trol the rate at which beneficial mutations are introduced 
rather than the sweep rate. The selective sweep may occur 
considerably later than the time at which positive selection 
is introduced because genetic drift, background selection 
and hitch-hiking can affect the time required for beneficial 
mutations to reach establishment. 

For a low supply rate of beneficial mutations, we 
expect beneficial mutations to fix primarily in succes- 
sive sweeps with rare occurrences of clonal interference, 
whereas clonal interference will occur with high probabil- 
ity when the supply rate of beneficial mutations is high. 
The expected time for a beneficial mutation to become 
established in the population is given by t es t = l/(uNLt,Sb) 
[50]; after establishment, the beneficial mutation behaves 
almost deterministically, increasing rapidly in frequency 
and is expected to fix in tfi x = \og(Nsb)/sb generations 
[50]. For population size N = 10 4 and u = 10 -6 , a sin- 
gle beneficial mutation of strength Sb = 0.01 is expected 
to have establishment and fixation times of t es t ^ 2857 
and tf ix ^ 460 generations. To obtain a high supply rate 
of beneficial mutations, we introduce positive selection at 
high frequency, specifically at one site in every r = 1000 
generations, which is faster than the rate of establish- 
ment. For a low supply rate of beneficial mutations, we 
set x = 10000 generations, so that establishment and fix- 
ation of one beneficial mutation is likely to occur before 
a second positively selected site is introduced. Note that 
varying the timing of positive selection controls the supply 
rate of beneficial mutations (generally parameterised as 
UbN = uLbN) indirectly. After positive selection is intro- 
duced at a site, Lb is increased by one; however, Lb is also 
decreased when a beneficial mutation reaches fixation. 

Simulations under independently segregating sites 

To compare sequence statistics obtained under complete 
linkage with those obtained under the assumption of inde- 
pendently segregating sites, we simulate the number of 
polymorphic and divergent sites according to the Poisson 
Random Field (PRF) model [10]. The PRF model assumes 
a Wright-Fisher population at equilibrium with an infi- 
nite number of sites so that all new mutations occur on 
distinct sites. Under these assumptions, Sawyer and Hartl 
[10] showed that number of sites carrying a derived muta- 
tion follows a Poisson random field, with expectations that 
are functions of the mutation and selection parameters. 
We use the PRF as it is the basis of a number of infer- 
ence methods [6,9,10,41], and therefore provides a better 
reference than a finite-site model with independently seg- 
regating sites. 

In the PRF framework [10], the number of derived sites 
can be simulated as independent Poisson variables. We 
can then use the number of divergent and polymorphic 
sites to calculate sequence statistics a>, <Zmk and a? as 
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described in the main text. In the following section, we 
give the equations used to calculate the mean number of 
divergent and polymorphic sites. 

In the case where there is no positive selection, the 
expected number of synonymous and non-synonymous 
divergent sites, as described in Sawyer and Hartl [10], is 
given by 

E(k' s ) = u s Lt (5) 
E(k' A ) = u A Lt j w(-s d ,N)p(s d ,3,s), (6) 

where &>(.) is given by Equation (19), p(.) is the DFE, t 
is the divergence time, L is the length of sequence, us = 
u/(l + c) and u A = wc/(l + c). Using p(.) as given in 
Equation (3), this can be simplified to [34] 



E(k> A) = u A Lt P (^)\( P + l,^ s+ l) 



where 



£(s, a) 



i r 



t s - 



e flt (l - e" f ) 



dt 



(7) 



(8) 



denotes the Hurwitz zeta function which is provided in 
the GNU scientific library [51]. When Lj, > 0 sites are 
positively selected, we generate the number of divergent 
non-synonymous sites over the deleterious portion of the 
sequence using Equation (7) and the number of diver- 
gent beneficial sites is generated from a truncated Pois- 
son distribution with mean w^L^to^, AT), capped at Ly. 
This allows comparison with the finite sites model, which 
explicitly does not allow recurrent positive selection at a 
single site. 

The expected number of derived polymorphic sites with 
selection coefficient s segregating at frequency x in the 
population is given by [52] 



6<p(x,Ns) 



9 1 - 



x(l — x) 1 — e 



-2Ns 



(9) 



where 9 = 2uNL is the mutation input rate. For a sample 
of size n with a known ancestral sequence, the expected 
numbers of synonymous and non-synonymous polymor- 
phic sites observed at frequency i, as given in Sawyer and 
Hartl [10], are 

E(p s (i)) = 6 S ^ ^jx i (l-x) n - i (l>(x,0)dx (10) 



e_s 

i 



(11) 



/' 

Jo 



p (s d , p,s)4> (x, -Ns d )dsdx (12) 



where 9s = 2usNL and 9 A = 2u A NL. Applying the 
gamma DFE used in our model, Equation (12) can also be 
simplified in terms of the Hurwitz zeta function to give 



I b(x, i, n — 
Jo 



3 + 1, h * I dx. 

2Ns J 



where 



nx 

x,a,b)= I x"~ x (1 - x) h ~ x dx , 
Jo 



(13) 



(14) 



denotes the incomplete beta function. 

To calculate sequence statistics under assumptions of 
independently segregating sites, we sample the number 
of segregating synonymous and non-synonymous poly- 
morphisms from Poisson distributions characterised by 
Equations (11) and (13). The number of observed diver- 
gent sites is given by 

n-l 

k s = k' s +-Yi Ps {i) (15) 
a 

i=i 

n-l 

k A = k' A + -Yi PA {i) (16) 

1=1 

where k' s and k' A are Poisson random variables described 
by Equations (5) and (7). 

Selection statistics 

In each simulation, we randomly sample n = 100 
sequences every 2000 generations. Based on each sample 
and the known ancestral sequence, we then calculate the 
K A /K$ and MK statistics as follows. Let p A (i) denote the 
number of derived polymorphic codon sites that are non- 
synonymous (relative to the ancestral codon) and occur i 
times in the sample of size n = 100, and similarly, let ps(i) 
denote the number of derived synonymous polymorphic 
sites that occur i times. Multiple mutations at the same 
site are counted as distinct polymorphisms. The num- 
ber of synonymous divergent sites and non-synonymous 
divergent sites is given respectively by 



1 " 

ks = - Tripsin 



i=l 

n 



k A 



i=i 



(17) 



(18) 



The Ka/Ks statistic is given by [2], 

kA_ 
ck s ' 

The scaling factor c = 2.4 accounts for the fact that non- 
synonymous mutations are more likely than synonymous 



(19) 
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mutations due to the structure of the genetic code. It is 
calculated by summing across the substitution matrix, in 
our case the Kimura two-parameter model [47]. Standard 
methods [53] will automatically account for this scaling 
factor. Using this scaling, to can be interpreted as a func- 
tion of the strength of selection s and the population 
size AT, which under the assumptions of a Wright-Fisher 
population with independently segregating sites is given 
by [9] 



w(Ns) 



2Ns 



(20) 



This is obtained by taking the ratio between fixation prob- 
abilities of a selected and a neutral mutation [54]. In the 
case where positive selection is not operating, the value 
of to summed across the entire sequence is equal to the 
proportion of effectively neutral sites, denoted / [5]. 

We use a modification of the MK test [4] which provides 
a quantitative measure of adaptive substitution [6], 



o-mk = I<a - ks 



(21) 



The MK statistic does not require a scaling factor c, as 
it is given in units of the number of non-synonymous 
substitutions. The offset (+1) term in the denominator 
means that this estimator is defined in all cases. Smith and 
Eyre-Walker [6] found that the offset does not introduce 
noticeable bias. 

The ratio in Equation (21) is an estimator of f, under the 
assumption that all segregating polymorphisms are selec- 
tively neutral. This assumption is valid when selection 
is strong so that selected mutations immediately reach 
fixation or extinction, but not when weak selection is fre- 
quent. This problem is further compounded in the context 
of linked selection as linkage has the effect of weakening 
the effective strength of selection so that both deleteri- 
ous and beneficial mutations can potentially segregate for 
longer prior to extinction or fixation. Here, we examine 
two modifications of the MK statistic. 

The first is motivated by weakly deleterious mutations 
that segregate transiently in the population, which are 
known to inflate the estimate of selective constraint and 
cause underestimation of the number of adaptive substi- 
tutions [34]. To correct for this, we exclude low-frequency 
(< 0.15) derived polymorphisms from the analysis, fol- 
lowing Fay et al. [55], giving 



a-F = k A - k s 



T,"=l0.15n]PA(0 



E 



n-1 

i=[0.15«] 



Psd) 



(22) 



where the square brackets indicate rounding to the near- 
est integer. A further modification used by Bhatt et al. 
[33] is to exclude high-frequency polymorphisms which 
are likely to contain beneficial mutations and would, if 



included, lead to an overestimate of / and therefore under- 
estimation of the number of adaptive substitutions, 



a B = k A - k s 



W0.75h] ,., 
E i= [Q.15„]^(0 

[0.75b] . 



(23) 



Both ap and ag were developed to account for selected 
variation segregating in the population on the assump- 
tion of independently segregating sites. However, in the 
context of frequent selection, linkage between sites is also 
likely to have a strong effect, motivating us to consider 
the performance of these statistics. For comparison with 
the MK statistics, it is helpful to consider the performance 
of an estimator that does not use polymorphism informa- 
tion. Based on the to statistic, we estimate the number of 
adaptive substitutions using 



o-d = !<a - ck s coo . 



(24) 



In fact, «d is not a true estimator as coq is a fixed value 
(treating / as known) rather than a measurable quantity. 
Here, too is obtained using the median value of to based 
on simulations with linkage and the same values of 
and 5 but no positive selection (to 0 = 0.09, 0.09, 0.11, 0.12 
for u = 10" 6 and to 0 = 0.10,0.13,0.23,0.33 for u = 
10 -5 ). We used simulations rather than the theoretical 
expectation of / to account for background selection. In 
practice, too cannot be estimated from divergence infor- 
mation unless there is a period where it is known positive 
selection has not occurred. However, we use od, as it pro- 
vides a comparison showing how ap and as differ in their 
estimation of / 
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